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Abstract 

Most sparse linear representation-based trackers need to 
solve a computationally expensive l\- regularized optimiza- 
tion problem. To address this problem, we propose a visual 
tracker based on non-sparse linear representations, which 
admit an efficient closed-form solution without sacrificing 
accuracy. Moreover, in order to capture the correlation in- 
formation between different feature dimensions, we learn 
a Mahalanobis distance metric in an online fashion and 
incorporate the learned metric into the optimization prob- 
lem for obtaining the linear representation. We show that 
online metric learning using proximity comparison signif- 
icantly improves the robustness of the tracking, especially 
on those sequences exhibiting drastic appearance changes. 
Furthermore, in order to prevent the unbounded growth 
in the number of training samples for the metric learn- 
ing, we design a time-weighted reservoir sampling method 
to maintain and update limited-sized foreground and back- 
ground sample buffers for balancing sample diversity and 
adaptability. Experimental results on challenging videos 
demonstrate the effectiveness and robustness of the pro- 
posed tracker. 

1. Introduction 

Robust visual tracking is an important problem in com- 
puter vision. In recent years, steady improvements have 
been made to the speed, accuracy and robustness of track- 
ing techniques. A crucial factor in many of these improve- 
ments has been the construction and optimization of object 
appearance models (e.g., [1-9]). Among these models, the 
linear representation, in which the object is represented as 
a linear combination of basis samples, has proved to be a 
simple yet effective choice. For example, Mei and Ling [2] 
propose a tracker based on a sparse linear representation 
which solves an l\ -regularized optimization problem. With 
the sparsity constraint, this tracker obtains a sparse regres- 
sion solution that can adaptively select a small number of 
relevant templates to optimally approximate the given test 
samples. The drawback is its expensive computation due to 



the need of solving an ^-norm convex problem. To speed 
up the tracking, Li et al. [4] propose to approximately solve 
the sparsity optimization problem using orthogonal match- 
ing pursuit (OMP). Recently, research has revealed that the 
£i-norm induced sparsity does not in general help improve 
the accuracy of image classification; and non-sparse repre- 
sentation based methods are typically orders of magnitudes 
faster than the sparse representation based ones with com- 
petitive and sometimes even better accuracy [10-12]. 

Inspired by these findings, here we propose a non-sparse 
linear representation based visual tracker. The proposed 
tracker can be implemented by solving a least-square prob- 
lem, which admits an extremely simple and efficient closed- 
form solution. To date, linear representation based track- 
ers [2, 4] have built linear regressors that are defined on 
independent feature dimensions (mutually independent raw 
pixels in both [2] and [4]). In other words, the correlation 
information between different feature dimensions is not ex- 
ploited. We argue that this correlation information is im- 
portant in tracking. To address this problem, we learn a 
Mahalanobis distance metric and incorporate it into the op- 
timization of the linear representation. 

Metric learning has emerged as a useful tool for many 
applications. For example, in [13, 14], a Mahalanobis dis- 
tance metric is learned using positive semidefinite program- 
ming. Discriminative metric learning has also been success- 
fully applied to visual tracking [15, 16]. These works learn a 
distance metric mainly for object matching across adjacent 
frames, and the tracking is not carried out in the framework 
of linear representations. In this work, we learn a distance 
metric using proximity comparison for linear representa- 
tion based tracking. The learning strategy is adapted from 
the online metric learning for image retrieval of Chechik 
et al. [17]. There, it has been shown that the online learn- 
ing procedure is efficient and capable for large-scale learn- 
ing. Nevertheless, it is not designed for dealing with time- 
varying data stream such as in real-time visual tracking. 

Visual tracking is a time-varying process which deals 
with a dynamic stream data in an online manner. Due to 
memory limit, it is often impractical for trackers to store all 
the stream data. Furthermore, visual tracking in the current 
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frame usually relies more on recently received samples than 
old samples due to its temporal coherence property. There- 
fore, it is necessary for trackers to maintain and update 
limited-sized data buffers for balancing between sample di- 
versity and adaptability. To address this issue, we propose 
to use reservoir sampling [18, 19] for sequential random 
sampling. The conventional reservoir sampling in [18, 19] 
can only accomplish the task of uniform random sampling, 
which ignores the importance variance among samples. We 
therefore need a time-weighted reservoir sampling. 

In summary, we propose a robust tracker that is based on 
metric-weighted linear representations and time-weighted 
reservoir sampling. Our main contributions are as follows. 

1 . We propose an online discriminative linear represen- 
tation for visual tracking. The metric-weighted least- 
square optimization problem admits a closed-form 
solution, which significantly improves tracking effi- 
ciency. We also demonstrate that, with the emergence 
of new data, the closed-form solution can be efficiently 
updated by a sequence of simple matrix operations. 

2. To further improve the discriminative capability of 
the linear representation for distinguishing foreground 
and background, we present an online Mahalanobis 
distance metric learning method and incorporate the 
learned metric into the optimization problem for ob- 
taining a discriminative linear representation. The 
learned metric can effectively capture the correla- 
tion information between different feature dimensions. 
Such correlation information plays an important role 
in robust object/non-object classification. 

3. To allow for real-time applications, we design a time- 
weighted reservoir sampling method to maintain and 
update limited-sized sample buffers for balancing be- 
tween sample diversity and adaptability in the metric 
learning procedure. With the theory of [20, 21], larger 
weights are assigned to those recently received sam- 
ples, which is particularly important for tracking. To 
our knowledge, it is the first time that reservoir sam- 
pling is used in an online metric learning setting that 
is tailored for robust visual tracking. 

2. The proposed visual tracker 

In this section, we describe the novel aspects of the pro- 
posed visual tracker: 

1. Object state estimation. This is implemented by an 
online metric-weighted optimization, as described in 
Section 2.1; 

2. Metric update using the online metric-weighted opti- 
mization in response to changing foreground and back- 
ground, as described in Section 2.2; 

3. Sample update used for object representation based on 
reservoir sampling, as described in Section 2.3. 



2.1. Online metric-weighted linear representation 

To effectively characterize dynamic appearance varia- 
tions during tracking, an object is associated with an ap- 
pearance subspace spanned by a set of basis samples, which 
encode the distribution of the object appearance. Therefore, 
the problem of visual tracking is converted to that of linear 
representation and reconstruction. As a result, the sample- 
to-subspace distance (e.g., linear reconstruction error) can 
be used for evaluating the likelihood of a test sample be- 
longing to the object appearance. However, the conven- 
tional linear representations (e.g., used in [2,4]) ignore the 
correlation information between feature dimensions. Due 
to the influence of complicated appearance variations, the 
correlation across feature dimensions usually differs greatly 
during tracking. In order to address this problem, we pro- 
pose a metric-weighted linear representation based on solv- 
ing a metric-weighted optimization problem under a learned 
distance metric. Consequently, the proposed linear repre- 
sentation is capable of capturing the varying correlation in- 
formation between feature dimensions. 

More specifically, given a set of basis samples P = 
(Pi)ili € K dxN and a test sample y e K dxl , we aim 
to discover a linear combination of P to optimally approx- 
imate the test sample y by solving the following optimiza- 
tion problem: 

ming(x;M,P,y) = min (y - Px) T M(y - Px), (1) 

X X 

where x e lZ Nxl and M is a symmetric distance metric 
matrix. The optimization problem (1) is a weighted lin- 
ear regression problem whose analytical solution can be di- 
rectly computed as: 

x* = (P T MP) _1 P T My. (2) 

If P T MP is a singular matrix, we directly use its pseu- 
doinverse to compute x*. The main computational time 
of Equ. (2) is spent on the calculation of (P T MP) _1 . 
For computational efficiency, we need to incrementally or 
decrementally update the inverse when P is expanded or 
reduced with one column under the same metric M. Let 
P n = (P Ap) denote the expanded matrix of P. Clearly, 
the following relation holds: 

fP ) T MP ( PTMP PTMAP ^ 

For simplicity, let H = (P T MP)-\ c = P T MAp, 
and r = (Ap) T MAp. Since M is a symmetric matrix, 
c T = (Ap) T MP. According to the theory of matrix com- 
putation [22], the corresponding inverse of (P n ) T MP n can 
be computed as: 

(II , Hcc T H He \ 

r -f u c . (3) 
r— c T Hc r— c T Hc / 
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Similarly, let P Q denote the reduced matrix of P after re- 
moving the i-th column such that 1 < i < N. Based 
on [22], the corresponding inverse of (P ) T MP can be 
computed as: 

((Po) T MP )-' = H(Z^) - H (^)H(^) ; (4) 

t±(i,i) 

where Xi = {1,2,..., -/V}\{i} stands for the index set ex- 
cept i. For adapting to object appearance changes, it is 
necessary for trackers to replace an old sample from the 
sample buffer with a new sample. In essence, the replace- 
ment operation can be decomposed into two stages: 1) old 
sample removal; and 2) new sample arrival. As a matter of 
fact, 1) and 2) correspond to the decre mental and incremen- 
tal cases, respectively. Given H = (P T MP) _1 , we first 
compute the decremental inverse ((P ) T MP ) _1 accord- 
ing to Equ. (4), and then calculate the incremental inverse 
((P Ap) T M(P Q Ap))" 1 using Equ. (3). For notational 
simplicity, we let P' = (P Q Ap), H Q = ((P ) T MP )-\ 
c = (P Q ) T MAp, and r = (Ap) T MAp. Based on 
Equ. (3), ((P') T MP') _1 can be computed as: 

(Ll I H c c T H H c \ 

17l HoC ° r-c^H c (5) 

r— c T H c i — c T H c / 

Furthermore, when updated according to Algorithm 2, M 
is modified as a rank-one addition such that M < — M + 
ry(a_a^ — a + a+) where a + = p p+ and a = p p 
are two vectors (defined in Equ. (14)) for triplet construc- 
tion, and r] is a step-size factor (defined in Equ. (21)). 
As a result, the original P T MP becomes P T MP + 
(r?P T a_)(P T a_) T + (-77P T a+)(P T a + ) T When M is 
modified by a rank-one addition, the inverse of P T MP can 
be easily updated according to the theory of [23,24]: 

T-1 11v Tt-1 

•' •» . : , „ <« 

Here, J = P T MP, u = r?P T a (or u = ryP T a + ), and 
v = P T a_ (or v = P T a + ). The complete procedure of 
online linear optimization under the metric M is summa- 
rized in Algorithm 1 . 

Furthermore, visual tracking is typically posed as a bi- 
nary classification problem. To address this problem, we 
need to simultaneously optimize the following two objec- 
tive functions: Xy- = argmin x/ g(xy; M, P^, y) and x£ = 
argmin Xii g(xft; M, Pb, y), where Pf and P& are fore- 
ground and background basis samples, respectively. Thus, 
we can define a discriminative criterion for measuring the 
similarity of the test sample y belonging to foreground 
class: 

S(y) = & [exp(-0//7/) - p cxp(-6» b /7 b )] , (7) 

where jf and j b are two scaling factors, Of = ff( x }; 
M,P/,y), 9 b = g(x^;M, P b ,y), pis a trade-off control 
factor, and a[-} is the sigmoid function. 



Algorithm 1 Metric-weighted linear representation 

Input: The current distance metric matrix M, the basis samples P = 

(Pi)£i e TZ dx N , any test sample y 6 TZ dy l . 
Output: The optimal linear representation solution x* . 

1. Build the optimization problem in Equ. (1): 

min S r(x; P, y) = min (y - Px) T M(y - Px) 

X X 

2. Compute the optimal solution x* = (P T MP) ~ 1 P T My. When P 
is expanded, reduced, or replaced by one column, the corresponding 
computation of (P T MP)~ 1 can be efficiently accomplished in an 
online manner: 

• Use Equ. (3) to compute the incremental inverse. 

• Employ Equ. (4) to calculate the decremental inverse. 

• Utilize Equ. (5) to obtain the replacement inverse. 

3. Update the inverse of P T MP by Equ. (6) when M is modified as a 
rank-one addition in Algorithm 2, and then repeat Steps 1 and 2. 

4. Return the optimal solution x* . 

Algorithm 2 Online distance metric learning using triplets 

Input: The current distance metric matrix M fc and a new triplet 

(P,P + ,P~)- 
Output: The updated distance metric matrix M fc+1 . 

1. Calculate a+ = p — p+ and a- = p — p 

2. Compute the optimal step length r\ that is formulated as: r\ = 

mm jC, max jO, 2a - £u t__ 2a r Ua+ _ ||u|| 2 )) with U being 

a_ai — a+a^_. 

3. M fc+1 *- M k + ri(a-a.Z - a+a£). 



2.2. Online metric learning using proximity com- 
parison 

To efficiently compute the linear representation solution 
in Equ. (2), we need to update the quadratic Mahalanobis 
distance metric in an online manner. Motivated by this, we 
propose an online metric learning scheme by solving a max- 
margin optimization problem using triplets. 

Suppose that we have a set of triplets {(p,p + ,p~)} with 
Pj P + i P € 72. . These triplets encode the proximity com- 
parison information. Without loss of generality, let us as- 
sume that the distance between p and p + is smaller than 
the distance between p and p~ . 

The Mahalanobis distance under metric M is defined as: 

^M(p,q) = (p-q) T M(p-q). (8) 

Clearly, M must be a symmetric and positive semidefinite 
matrix. It is equivalent to learn a projection matrix L such 
that M = LL T . In practice, we generate the triplets set as: 
p and p+ belong to the same class and p and p~ belong to 
different classes. So we want the constraints Dm(p, P + ) < 
Dm(p, p~) to be satisfied as well as possible. By putting it 
into a large-margin learning framework, and using the soft- 
margin hinge loss, the loss function for each triplet is: 

^m(p,P + ,P~) = max{0, l + D M (p, P + ) --Dm(p, P~)}- 

(9) 

To obtain the optimal distance metric matrix M, we need 
to minimize the global loss Lm that takes the sum of hinge 
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losses (9) over all possible triplets from the training set: 

Lm= Mp,p + ,p~), (10) 

(p,p+,p-)e<2 

where Q is the triplet set. To sequentially optimize the 
above objective function L M in an online fashion, we de- 
sign an iterative algorithm to solve the following convex 
problem: 

M fe+i = argmm i || M _ M fe |||, + C£, 

M (11) 

s.t. £> M (p, p-) - £> M (p, p+) > 1 - £, £ > 0, 

where || • \\ F denotes the Frobenius norm, £ is a slack vari- 
able, and C is a positive factor controlling the trade-off be- 
tween the smoothness term |||M — M fe ||f, and the loss 
term £. According to the passive-aggressive mechanism 
used in [17,25], we only update the metric matrix M when 

Wp>p + ,p~) > o. 

Subsequently, we derive an optimization function with 
Lagrangian regularization: 

£(M,?7,£,/3) = l\\M-M k f F + C^-^ 

+r ] (l-Z + D M (p,p+)-D m (p,p-)), 

where i] > and (3 > are Lagrange multipliers. By taking 
the derivative of £(M, r/, £, /3) with respect to M, we have 
the following: 



aC(M,r),g„3) iyr _ lVffc i d[-P M (p,P+ )--D M (p,P 



_)}_ 

(13) 



Mathematically, 91d m ^ p +)^d m ^ p )] can bg formulated 



as: 



d[£>]yi(p,P + ) - £>m(p,P" 
(AI 



_ T T 

— <1 <l 3 (I 
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where a + = p p + and a = p — p . Therefore, the 
optimal M fe+1 is obtained by setting a£( ^' C ' /3) to zero. 
As a result, the following relation holds: 

M k+1 =M fe + r7(a_a T - a+a£). (15) 

Subsequently, we take the derivative of the Lagrangian (12) 
with respect to £ and set it to zero: 

dC(M, v ,Z,l3) 



C-p-r) = 0. 



(16) 



Clearly, j3 > leads to the fact that rj < C. For nota- 
tional simplicity, a a^ — a + a+ is abbreviated as U here- 
inafter. By substituting Equs. (15) and (16) into Equ. (12) 
with M = M fc+1 , we have: 

£(v) = ^ 2 II u III'+^( 1 +- d m'=+i(p,p + )--Dm'=+i(p,p _ )), 

(17) 

where D Mk +i (p, p+) = &\ (M k +r/TJ)a + and D Mk +i (p, 
p ) = aT(M fc + t]\J)a.-. As a result, £(77) can be refor- 
mulated as: 



Algorithm 3 Time-weighted reservoir sampling 

Input: Current buffers Bf and B b together with their corresponding keys, 
a new training sample p, maximum buffer size S7, time- weighted 
factor q. 

Output: Updated buffers B f and B b together with their corresponding keys. 

1. Obtain the samples py g Bf and pj* g Bf, with the smallest 
keys k*j and k^ {mm Bf and Bf,, respectively. 

2. Compute the time-related weight w = q l with I being the corre- 
sponding frame index number of p . 

3. Calculate a key k = u^l w where u ~ rand(0, 1). 

4. Case: p g foreground 

if |B/| < Qthea 

• %=%U{P}- 

else 

• p^ is replaced with p if k > k*j. 
endif 

Case: p g background 
if \B b \ < Qthen 
. B b =Bf,|J{p}- 

else 

• pj* is replaced with p if k > fcj* . 
endif 

5. Return B ^ and Bf, together with their corresponding keys. 



where A 2 = |||U||| + a^Ua+ - a T Ua , Ai = 1 + 
a^M k a + - a^M fe a_, and A = 0. To obtain the optimal 
7], we need to differentiate C{r]) with respect to 77 and set it 
to zero: 

« = „(||U||S. + 2a^Ua + - 2a^Ua_) 

+(l + a^M fe a+-a T M fc a_) = 0. k ' 
As a result, the following relation holds: 

l + alM fe a+ a T M fe a 



n 



|U||2 + 2a^Ua+ - 2a T Ua 



(20) 



Due to the constraint of < r\ < C, r\ should take the 
following value: 



- aiM fe a 



|U||| 



(21) 



(18) 



f f 1 + a^M fe a+ 

,= m in|C7,max|0, 2a r Ua __ 2a r u<M 

The complete procedure of online distance metric learning 
is summarized in Algorithm 2. 

2.3. Time-weighted reservoir sampling 

We compute a linear representation solution (Equ. (2)) 
for two separate sample buffers consisting of foreground 
and background basis samples. Ideally, the sample buffers 
should keep a balance between sample diversity and adapt- 
ability. Motivated by this, reservoir sampling [18-21] is 
proposed for sequential random sampling. In principle, it 
aims to randomly draw some samples from a large pop- 
ulation of samples that come in a sequential manner. A 
classical version of reservoir sampling is able to effectively 
simulate the process of uniform random sampling [18, 19]. 
However, it is inappropriate for visual tracking because the 
samples used in visual tracking are dynamically distributed 
as time progresses. Usually, the samples occurring recently 



4 



Algorithm 4 Metric -weighted linear representation based 
visual tracking with time-weighted reservoir sampling 

Input: Frame t , previous object state _ , , previous distance metric ma- 
trix Mt-i, foreground buffer Bf with its basis samples P f, background 
buffer Bb with its basis samples Pj,, number of particles /C. 
Output: Current object state Z£, updated metric matrix Mj, updated Bf 
and £?(,. 

1 : Sample a number of candidate object states {Z^ }JF =1 using the parti- 
cle filters (i.e., Gaussian dynamical model used in [1]). 
Crop out the corresponding image regions of {Z*}5Li • 
Extract the corresponding HOG feature set {yfc}JF =1 . 
Perform the metric- weighted optimization in Equ. (1) with 
min x/ g(x/;M t -i,P/,yfe) and min X6 g(x b ; M t -i , Pf,, yfe). 
5: Determine the optimal object state Z* by the MAP (maximum a pos- 
terior) estimation in the particle filters, where the observation model is 
defined in Equ. (7) such that p (y ^ | Z ) <x 5 (y £ ) . 
6: Collect new foreground and background samples V f (J V\y according 
to the spatial distance-based mechanism of training sample selection. 
7: Carry out time-weighted reservoir sampling in Algorithm 3 to itera- 

tively update Bf and B\> with new training samples from Vf (Jv-V 
8: Perform the triplet sampling procedure (s.t. intra-class relevance and 
inter-class irrelevance) in [17] over Bf (J Bt to generate a triplet set 

S = {( P ,P+,p-)}. 

9: Run online metric learning in Algorithm 2 to update M t _i for each 
triplet in Q, and finally obtain M;. This step can be performed every 
few frames. 
10: Return ZJ, Mt, Bf, and Bh- 



have a greater influence on the current tracking process than 
those appearing a long time ago. Therefore, larger weights 
should be assigned to the recently added samples while 
smaller weights should be attached with the old samples. 
Inspired by [20,21], we design a time-weighted reservoir 
sampling (TWRS) method for randomly drawing the sam- 
ples according to their time-varying properties, as listed in 
Algorithm 3. The designed TWRS method is capable of ef- 
fectively maintaining the sample buffers for online metric 
learning in Sec. 2.2. 

By integrating the above-mentioned three modules (i.e., 
metric-weighted linear representation, online metric learn- 
ing, and time-weighted reservoir sampling) into a particle 
filtering framework, we obtain a visual tracker whose com- 
plete procedure is shown in Algorithm 4. 

3. Experiments 

Experimental setup In order to evaluate the proposed 
tracking algorithm, we conduct a set of experiments on 
thirteen challenging video sequences consisting of 8 -bit 
grayscale images. These video sequences are captured from 
different scenes, and contain a variety of object motion 
events (e.g., human walking and car running). 

The proposed tracking algorithm is implemented in Mat- 
lab on a workstation with an Intel Core 2 Duo 2.66GHz 
processor and 3.24G RAM. The average running time of 
the proposed tracking algorithm is about 0.55 second per 
frame. For the sake of computational efficiency, we sim- 
ply consider the object state information in 2D translation 
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Figure 1: Quantitative evaluation of the proposed tracker using different buffer sizes 
on five video sequences (i.e., "cubicle", "trace", "BalanceBeam", "Walk", and "seq- 
jd"). The left and right subfigures correspond to the tracking performance of the 
proposed tracking algorithm in VOR and CLE, respectively. 
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Figure 2: Quantitative evaluation of the proposed tracker using different particle 
numbers on three video sequences (i.e., "iceball", "trellis70", and "seq-jd"). The left 
and right subfigures are associated with the tracking performance in average VOC 
success rate and tracking duration for each frame, respectively. 



and scaling in the particle filtering module, where the cor- 
responding variance parameters are set to (10, 10, 0.1). The 
number of particles is set to 200. For each particle, there 
is a corresponding image region represented as a HOG fea- 
ture descriptor (referred to [26] and efficiently computed by 
using integral histograms) with 3x3 cells (each cell is rep- 
resented by a 9-dimensional histogram vector) in the five 
spatial block-division modes (like [27]), resulting in a 405- 
dimensional feature vector for the image region. The num- 
ber of triplets used for online metric learning is chosen as 
500. The maximum buffer size £1 and time-weighted factor 
q in Algorithm 3 is set as 300 and 1.6, respectively. The 
scaling factors 7/ and 7^ in Equ. (7) are chosen as 1. The 
trade-off control factor p in Equ. (7) is set as 0.1. Note that 
the aforementioned parameters are fixed throughout all the 
experiments. 

To demonstrate the effectiveness of the proposed 
tracking algorithm, we compare it with other state-of- 
the-art trackers in both qualitatively and quantitatively. 
These trackers are referred to as FragT (Fragment-based 
tracker [28]), MILT (multiple instance boosting-based 
tracker [29]), VTD (visual tracking decomposition [3]), 
OAB (online AdaBoost [30]), IPCA (incremental PCA [1]), 
LIT (£1 minimization tracker [2]), and DMLT (discrimi- 
native metric learning tracker [15]). In the experiments, 
some of the aforementioned trackers are implemented us- 
ing their publicly available source code, including FragT, 
MILT, VTD, OAB, IPCA, and LIT. For OAB, there are two 
different versions (namely, OAB1 and OAB5), which are 
based on two different configurations (i.e., the search scale 
r = 1 and r = 5 as in [29]). For quantitative performance 



comparison, two popular evaluation criteria are introduced, 
namely, center location error (CLE) and VOC overlap ratio 
(VOR) between the predicted bounding box B p and ground 
truth bounding box B gt such that VOR = ^j^Q 
If the VOC overlap ratio is larger than 0.5, then it is consid- 
ered successful tracking. 

Effect of different buffer sizes We aim to investigate 
the effect of using different buffer sizes for visual tracking. 
Motivated by this, a quantitative evaluation of the proposed 
tracking algorithm is performed in nine different cases of 
buffer size. Meanwhile, we compute the average CLE and 
VOR for each video sequence in each case of buffer size. 
Fig. 1 shows the quantitative CLE and VOR performance 
on five video sequences. It is clear that the average CLE 
(VOR) decreases (increases) as the buffer size increases, 
and plateaus with approximately more than 300 samples. 

Evaluation of different particle numbers In general, 
more particle numbers enable visual trackers to locate the 
object more accurately, but lead to a higher computational 
cost. Thus, it is crucial for visual trackers to keep a good 
balance between accuracy and efficiency using a moder- 
ate number of particles. Motivated by this, we examine 
the tracking performance of the proposed tracking algo- 
rithm with respect to different particle numbers. The left 
part of Fig. 2 shows the average VOC success rates (i.e., 

#success frames ) f ^ p ropose d tracking algorithm on 

#total frames ' F F fe & 

three video sequences. From the left part of Fig. 2, we can 

see that the success rate rapidly grows with the increase of 
particle number and finally converges. The right part of 
Fig. 2 displays the average CPU time (spent by the pro- 
posed tracking algorithm in each frame) with different par- 
ticle numbers. It is observed from the right part of Fig. 2 
that the average CPU time slowly increase. 

Performance with and without metric learning Met- 
ric learning is able to improve the intra-class compactness 
and inter-class separability of samples. In metric learning, 
three types of learning mechanisms can be used, including 
no eigendecomposition, step-by-step eigendecomposition, 
and final eigendecomposition [17]. To justify the effect of 
different metric learning mechanisms, we design several ex- 
periments on four video sequences. Fig. 3 shows the cor- 
responding experimental results of different metric learning 
mechanisms in both CLE and VOR on two of the four video 
sequences (note that the results for the other two video se- 
quences can be found in the supplementary file). Tab. 1 re- 
ports the average success rates of different metric learning 
mechanisms on the four video sequences. From Fig. 3 and 
Tab. 1, we can see that the performance of metric learning 
is better than that of no metric learning. In addition, the per- 
formance of metric learning with no eigendecomposition is 
close to that of metric learning with step-by-step eigende- 
composition, and better than that of metric learning with 



rulik'li.' 



§15 

Q 

S10 



■ ML W/O Eigen 

i ML With Final Eigen 

■ ML With Step-by-Step Eigen 

■ No ML 



,1 





■ ML W/O Eigen 

■ ML With Final Eigen 

■ ML With Step-bv-Step Eigen 

■ No ML 



Frame Index 

I'm 'I liall 



-g 60 
a 

O 40 



■ ML W/O Eigen 

■ ML With Final Eigen 

■ ML With Stqi-bv-St.-p Eigon 
< No ML 



5 H- r ) 

Oo.4 
U 

On :; 




ML W/O Eigen 

ML With Final Eigen 

ML With Step-bv-Step Eigen 

No ML 



10 



20 



30 



40 



Frame Index Frame Index 

Figure 3: Quantitative evaluation of the proposed tracker with/without metric learn- 
ing on two video sequences. The top two subfigures are associated with the tracking 
performance in CLE and VOR on the "cubicle" video sequence, respectively; the bot- 
tom two subfigures correspond to the tracking performance in CLE and VOR on the 
"football" video sequence, respectively. 





cubicle 


football 


iceball 


trellis70 


ML w/o eigen 


0.98 


0.88 


0.93 


0.98 


ML with final eigen 


0.94 


0.74 


0.90 


0.94 


ML with step-by-step eigen 


0.98 


0.90 


0.95 


0.99 


No metric learning 


0.86 


0.36 


0.88 


0.91 



Table 1: Quantitative evaluation of the proposed tracker with/without metric learning 
on four video sequences The table reports their average success rates for each video 
sequence. 



final eigendecomposition. Therefore, the obtained results 
are consistent with those in [17]. Besides, metric learning 
with step-by-step eigendecomposition is much slower than 
that with no eigendecomposition which is adopted by the 
proposed tracking algorithm. 

Comparison of different linear representations The 
objective of this task is to evaluate the performance of four 
types of linear representations including our linear represen- 
tation with metric learning, our linear representation with- 
out metric learning, compressive sensing linear representa- 
tion [4], and l\ -regularized linear representation [2]. For a 
fair comparison, we utilize the raw pixel features which are 
the same as [4,2]. Fig. 4 shows the performance of these 
four linear representation methods in CLE on four video 
sequences. Clearly, our linear representation with metric 
learning consistently achieves lower CLE performance in 
most frames than the three other linear representations. 

Evaluation of different sampling methods Reservoir 
sampling [18] addresses the problem of randomly drawing 
the uniformly distributed samples in a sequential manner. 
Following the work of [18], a weighted version of reser- 
voir sampling is proposed in [21], which assign different 
weights to the samples occurring at different time points. 
Based on this weighed reservoir sampling method, the pro- 
posed tracking algorithm is capable of adaptively updating 
the sample buffer as tracking proceeds. Here, we aim to ex- 
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Figure 4: Quantitative comparison of different linear representation methods in CLE 
on four video sequences (i.e., "football3", "seq-jd", "trace", and "Walk"). 



amine the performance of the two sampling methods. Fig. 5 
shows the experimental results of the two sampling meth- 
ods in CLE on four video sequences (note that the VOR 
results for these four video sequences can be found in the 
supplementary file). From Fig. 5, we can see that weighted 
reservoir sampling performs better than ordinary reservoir 
sampling. 

Comparison of competing trackers Fig. 6 plots the 
frame-by-frame center location errors (highlighted in differ- 
ent colors) obtained by the nine trackers for the first eight 
video sequences. Tab. 2 reports the success rates of the 
nine trackers over the thirteen video sequences. From Fig. 6 
and Tab. 2, we observe that the proposed tracking algorithm 
achieves the best tracking performance on most video se- 
quences. 

Discussion Overall, the proposed tracking algorithm has 
the following properties. First, after the buffer size exceeds 
a certain value (around 300 in our experiments), the track- 
ing performance keeps stable with an increasing buffer size, 
as shown in Fig. 1. This is desirable since we do not need a 
large buffer size to achieve promising performance. Second, 
in contrast to many existing particle filtering-based trackers 
whose running time is typically linear in the number of par- 
ticles, our method's running time is sublinear in the num- 
ber of particles, as shown in Fig. 2. Moreover, its tracking 
performance rapidly improves and finally converge to a cer- 
tain value, as shown in Fig. 2. Third, as shown in Fig. 3 
and Tab. 1, the performance of our metric learning with 
no eigendecomposition is close to that of computationally 
expensive metric learning with step-by-step eigendecompo- 
sition. Fourth, based on linear representation with metric 
learning, it performs better in tracking accuracy, as shown in 
Fig. 4. Fifth, it utilizes weighed reservoir sampling to effec- 
tively maintain and update the foreground and background 
sample buffers for metric learning, as shown in Fig. 5. Last, 
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0.87 


0.37 
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0.49 
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0.61 
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Table 2: The quantitative comparison results of the nine trackers over the thirteen 
video sequences. The table reports their tracking success rates over each video se- 
quence. 



compared with other state-of-the-art trackers, it is capable 
of effectively adapting to complicated appearance changes 
in the tracking process by constructing an effective metric- 
weighted linear representation with weighed reservoir sam- 
pling, as shown in Fig. 6 and Tab. 2. 

4. Conclusion 

We have proposed a robust visual tracker based on 
non-sparse linear representations, which can be solved ex- 
tremely efficiently in closed-form. Compared with recent 
sparse linear representation based trackers [2,4], even with 
this simple implementation, our tracker is already much 
faster with comparable accuracy. To further improve the 
discriminative capacity of the linear representation, we 
have presented online Mahalanobis distance metric learn- 
ing, which is able to capture the correlation information be- 
tween feature dimensions. We empirically show that com- 
bining a metric into the linear representation considerably 
improve the robustness of the tracker. To make the online 
metric learning even more efficient, for the first time, we 
design a learning mechanism based on time-weighted reser- 
voir sampling. With this mechanism, recently streamed 
samples in the video are assigned more importance weights. 
We have also theoretically proved that metric learning based 
on the proposed reservoir sampling with limited-sized sam- 
pling buffers can effectively approximate metric learning 
using all the received training samples. Compared with 
a few state-of-the-art trackers on thirteen challenging se- 
quences, we empirically show that our method is more ro- 
bust to complicated appearance changes, pose variations, 
and occlusions, etc. 

Acknowledgments This work is in part supported by 
ARC Discovery Project (DP1094764). 

References 

[1] D. A. Ross, J. Lim, R. Lin, and M. Yang, "Incremental 
learning for robust visual tracking," Int. J. Comp. Vis., 
vol. 77, no. 1, pp. 125-141, 2008. 

[2] X. Mei and H. Ling, "Robust visual tracking and ve- 
hicle classification via sparse representation," IEEE 
Trans. Pattern Anal. Mach. Intell., 2011. 

[3] J. Kwon and K. M. Lee, "Visual tracking decomposi- 



7 




Figure 5: Quantitative comparison of different sampling methods in CLE on four video sequences (i.e., "cokel 1", "Lola", "trace", and "Walk"). Before exceeding the buffer size 
limit (approximately occurring between frame 40 and frame 50), the performances of different sampling methods are identical. 




30 40 00 60 70 50 100 100 200 200 000 000 0)0 00o 000 00 100 100 000 200 00 100 100 200 200 000 

Frame Index Frame Index Frame Index Frame Index 

Figure 6: Quantitative comparison of different trackers in CLE on the first eight video sequences. 



tion," in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 
2010, pp. 1269-1276. 

[4] H. Li, C. Shen, and Q. Shi, "Real-time visual track- 
ing with compressive sensing," in Proc. IEEE Conf. 
Comp. Vis. Patt. Recogn., 2011. 

[5] S. Hare, A. Saffari, and P.H.S. Torr, "Struck: Struc- 
tured output tracking with kernels," in Proc. IEEE Int. 
Conf. Comp. Vis., 2011. 

[6] X. Li, W. Hu, Z. Zhang, X. Zhang, and G. Luo, "Ro- 
bust visual tracking based on incremental tensor sub- 
space learning," in Proc. IEEE Int. Conf. Comp. Vis., 
2007, pp. 1-8. 

[7] X. Li, A. Dick, H. Wang, C. Shen, and A. van den 
Hengel, "Graph mode-based contextual kernels for 
robust SVM tracking," in Proc. IEEE Int. Conf. Comp. 
Vis.,2011,pp. 1156-1163. 

[8] X. Li, W. Hu, H. Wang, and Z. Zhang, "Robust object 
tracking using a spatial pyramid heat kernel structural 
information representation," Neurocomputing, vol. 73, 
no. 16-18, pp. 3179-3190, 2010. 

[9] C. Shen, J. Kim, and H. Wang, "Generalized kernel- 
based visual tracking," IEEE Trans. Circuits & Sys- 
tems for Video Tech., vol. 20, no. 1, pp. 119-130, 
2010. 

[10] Q. Shi, A. Eriksson, A. van den Hengel, and C. Shen, 
"Is face recognition really a compressive sensing 



problem?" in Proc. IEEE Conf. Comp. Vis. Patt. 
Recogn., 2011. 

[11] R. Rigamonti, M. A. Brown, and V. Lepetit, "Are 
sparse representations really relevant for image clas- 
sification?," in Proc. IEEE Conf. Comp. Vis. Patt. 
Recogn., 2011, pp. 1545-1552. 

[12] L. Zhang, M. Yang, and X. Feng, "Sparse representa- 
tion or collaborative representation: Which helps face 
recognition?," in Proc. IEEE Int. Conf. Comp. Vis., 
2011. 

[13] K.Q. Weinberger, J. Blitzer, and L.K. Saul, "Distance 
metric learning for large margin nearest neighbor clas- 
sification," in Proc. Adv. Neural Inf. Process. Syst., 
2006. 

[14] C. Shen, J. Kim, L. Wang, and A. van den Hengel, 
"Positive semidefinite metric learning with boosting," 
in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 
1651-1659. 

[15] X. Wang, G. Hua, and T. Han, "Discriminative track- 
ing by metric learning," Proc. Eur. Conf. Comp. Vis., 
pp. 200-214, 2010. 

[16] N. Jiang, W. Liu, and Y. Wu, "Adaptive and discrim- 
inative metric differential tracking," in Proc. IEEE 
Conf. Comp. Vis. Patt. Recogn., 2011, pp. 1161-1168. 

[17] G. Chechik, V. Sharma, U. Shalit, and S. Ben- 
gio, "Large scale online learning of image similarity 



8 



through ranking," J. Mach. Learn. Research, vol. 11, 

pp. 1109-1135, 2010. 
[18] J. S. Vitter, "Random sampling with a reservoir," ACM 

Trans. Math. Software, vol. 11, no. 1, pp. 37-57, 1985. 
[19] P. Zhao, S.C.H. Hoi, R. Jin, and T. Yang, "Online 

AUC maximization," in P roc. Int. Conf. Mach. Learn., 

2011. 

[20] M. Kolonko and D. Wasch, "Sequential reservoir sam- 
pling with a non-uniform distribution," ACM Trans. 
Math. Software, vol. 32, pp. 257-273, 2004. 

[21] R S. Efraimidis and R G. Spirakis, "Weighted ran- 
dom sampling with a reservoir," Information process, 
letters, vol. 97, no. 5, pp. 181-185, 2006. 

[22] A. Jennings and J. McKeown, Matrix computation, 
John Wiley & Sons Inc., 1992. 

[23] A. S. Householder, The theory of matrices in numer- 
ical analysis, Blaisdell Publishing Co.: New York, 
1964. 

[24] M. J. D. Powell, "A theorem on rank one modifications 
to a matrix and its inverse," The Computer Journal, 
vol. 12, no. 3, pp. 288-290, 1969. 

[25] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, 
and Y. Singer, "Online passive-aggressive algo- 
rithms," J. Mach. Learn. Research, vol. 7, pp. 551- 
585, 2006. 

[26] N. Dalai and B. Triggs, "Histograms of oriented gradi- 
ents for human detection," in Proc. IEEE Conf. Comp. 
Vis. Patt. Recogn., 2005. 

[27] X. Li, W. Hu, Z. Zhang, X. Zhang, M. Zhu, and 
J. Cheng, "Visual tracking via incremental log- 
euclidean riemannian subspace learning," in Proc. 
IEEE Conf. Comp. Vis. Patt. Recogn., 2008, pp. 1-8. 

[28] A. Adam, E. Rivlin, and I. Shimshoni, "Ro- 
bust fragments-based tracking using the integral his- 
togram," in Proc. IEEE Conf. Comp. Vis. Patt. 
Recogn., 2006, pp. 798-805. 

[29] B. Babenko, M. Yang, and S. Belongie, "Visual track- 
ing with online multiple instance learning," in Proc. 
IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 983- 
990. 

[30] H. Grabner, M. Grabner, and H. Bischof, "Real-time 
tracking via on-line boosting," in Proc. British Ma- 
chine Vis. Conf, 2006, pp. 47-56. 



